Method for analyzing sets of temporal data

ABSTRACT

A method for analyzing sets of temporal data using a computer wherein each set of temporal data includes a plurality of records collected at a time unique to each such set and in which each record has a plurality of data items. The method includes the first step of creating data association rules for at least a plurality of sequential sets wherein each association rule represents data records having at least some common data items ( 100 ). A confidence factor is then determined for each such association rule and these confidence factors are stored in data partitions for the temporal data sets ( 102 ). The confidence factors for a selected data partition is then compared with the corresponding confidence factors of at least one other data partition ( 112 ), if available. When the confidence factor for the selected data partition varies from the corresponding confidence factor for the at least one other data partition exceeds a threshold value, an alert output signal is generated ( 114 ).

BACKGROUND OF THE INVENTION

I. Field of the Invention

The present invention relates generally to a data analysis computerprogram and, more particularly, to a data analysis program for analyzingsets of temporal data such as temporal health care surveillance data,and especially epidemiological data.

II. Description of the Prior Art

There are many health care databases, e.g. epidemiology databases,containing temporal data, i.e. data which is collected at periodic timeintervals. Such databases, furthermore, typically include bacterialantimicrobial data, resistance data and the like at hospital, regionaland national levels. Domain experts in epidemiology and laboratorymedicine currently review the antimicrobial susceptibility data at halfyear, yearly or even longer intervals in an effort to discoversignificant new patterns, information and trends of the data. This timedeferred and late discovery of such trends results in increasedinefficiency and increased cost of treatment in the medical field.

Additionally, at present domain experts perform only manual analysis ofthe data in an effort to discover trends and patterns of health care orepidemiological data. Such manual analysis includes database queries andconfirmatory statistics to specific questions in an effort to testspecific hypotheses. These traditional methods of data analysis,however, offer no way to discover patterns and trends that are notsuspected by the investigators of the data. Consequently, suchunsuspected trends and patterns are simply ignored and remainundiscovered even though such trends and patterns may be significant.

SUMMARY OF THE PRESENT INVENTION

The present invention provides a method for analyzing sets of temporaldata, especially epidemiological data, which automatically identifiessignificant trends and patterns in the data and does so in a timelyfashion.

In brief, the method of the present invention analyzes sets of temporaldata wherein each set of temporal data comprises a plurality of recordscollected at a time period unique to each such set. Each record has aplurality of data items including, for example, patient characteristics,the organism isolated, source of the sample, date reported, location ofpatient and one or more antimicrobials used to test the sample against.

The method of the present invention includes the first step of creatingdata association rules for at least a plurality of sequential data sets,i.e. sequential temporal data sets, wherein each such data set includesat least some common data items. Each association rule is onlyconsidered if it has precondition support in some predetermined numberof records. Otherwise, the association rule is discarded asstatistically insignificant.

After determining the data association rules, the confidence factor foreach such association rule is determined where the confidence factor forthe association rule AB represents the likelihood or probability of Bgiven A.

For example, given a data item A and a data item B where theintersection of A and B is empty, the confidence factor Conf(R, P) whererule R=(AB) in partition P, Conf(R, P), is S(A∪B)/S(A) where S(X) is theSupport of X in P. Such association rules, together with the confidencefactor for each such rule, is stored in a history.

In order to determine significant patterns or trends over time, theassociation rule and confidence factors for the current data set arefirst determined. The confidence factors for each association rule arethen compared with the confidence factors for the correspondingassociation rule, if present in the history, from previous datapartitions. A change in confidence of a particular association rule,such that the probability that the change occurred by chance is lessthan some predefined percentage (e.g. 5%) as determined by a chi-squaretest of two proportions or some other applicable statistical test,generates an alert signal to the operator. Following analysis of all ofthe data in the current data set, the alert signals are displayed orotherwise conveyed to the operator user who then takes whatever actionis appropriate.

In the preferred embodiment of the invention, the alerts are clusteredinto events prior to displaying such alerts to the operator. Such alertclustering groups descendant association rules with the parentassociation rule into an event. An association rule A1B1 is defined as adependent of association rule A2B2 if the set of items in A2 iscontained in A1 and, likewise, B2 ⊂B2. Dependent also contains that adescendent association rule accounts for the change detected in theparent association rule.

A primary advantage of the present invention is that it rapidlyidentifies related clusters of high support association rules whoseconfidences change significantly over time. Using traditional methods,these clusters might be overlooked.

BRIEF DESCRIPTION OF THE DRAWING

A better understanding of the present invention will be had uponreference to the following detailed description when read in conjunctionwith the accompanying drawing, wherein like reference characters referto like parts throughout the several views, and in which:

FIG. 1 is a flow chart illustrating a portion of the preferredembodiment of the present invention; and

FIG. 2 is a flow chart illustrating another portion of the preferredembodiment of the present invention.

DETAILED DESCRIPTION OF A PREFERRED EMBODIMENT OF THE PRESENT INVENTION

Temporal or time slice health care data is typically collected in datasets, each data set P representing a plurality of individual recordswhere n represents the particular data set. Each individual record, inturn contains data items a, b, c etc. The individual data items,however, are mutually exclusive of each other. Frequent sets A,B, eachcontaining sets of items are also defined. For example frequent set Amay contain items a, b and c while frequent set B may contain items d, eand f. Thus, an item X cannot be in both frequent sets A and B for agiven rule AB.

Using conventional methods, association rules, e.g. R=AB, are thendetermined for all of the data records in the data partition P thatoptionally initially pass user defined rule templates. Such associationrules R, however, are only considered when there is sufficient support,i.e. the items of A are found together in a sufficient number of datarecords, in a particular data set P. Otherwise, the association rule issimply discarded as being statistically insignificant.

Once the association rules have been determined for the data set P_(C),a confidence factor Conf(R, P_(C)) is then determined for eachassociation rule R in the data partition P_(C).

For example, assuming association rule R=AB has sufficient support in P,if the items in B occur in 60% of the data records of P that the itemsof A occur in, then the probability of B given A, Conf(R, P), is 60%.

Any conventional means may be utilized to identify the associationrules, ensure that the association rules have sufficient support to havestatistical interest as well as to calculate the confidence factorConf(R, P) for each association rule R in the entire data set P.

According to the present invention and as will be shortly described ingreater detail, the confidence factors Conf(R, P_(C)) of the currentdata set P_(C) are then compared with the corresponding confidencefactors Conf(R, P_(C-i)) for previous data sets, if such associationrule is present in the previous data sets. When the change in theconfidence factor for a particular association rule(s) exceeds apredetermined threshold, e.g. 5% variation as determined by a chi-squaretest of two proportions or some other applicable statistical test, analert signal is created and stored. Following the analysis of all of thedata in the current data set P_(C), the alert signals generated by thecomparison of the confidence factors is then displayed to the operator.Furthermore, the alert signals are preferably clustered into events ofancestor and descendent association rules where a descent associationrule A2B2 is a descendent of the association rule A1B1 if the set ofitems A2 is contained in A1 and, likewise, B2 ⊂B1. Dependent alsocontains that a descendent association rule accounts for the changedetected in the parent association rule. Such clustering of relatedancestor and descendent association rules into events enables moreefficient data analysis by the operator.

With reference now to FIG. 1, analysis of the data items in the currentdata set P_(C)(C=current) is compared with the association rules andconfidence factors in prior data sets P_(C-1), P_(C-2), P_(C-3) . . . isthere shown. At step 100, the association rules R and confidence factorsConf(R, P_(C-i)) are first read from the history where i equals acounter for the prior data sets. The history is conventionally stored onmagnetic media of any conventional type.

At step 102 the association rules R as well as the confidence factorsConf(R, P_(C)) are then determined for the data contained in the currentdata set or current partition P_(C). Step 102 then exits to step 104which updates the history for the current data set P_(C).

Step 104 then branches to step 106 which sets the variable i equal to 1.The variable i is used, as will become shortly evidence, to iteratethrough the association rules and confidence factors in priorpartitions. Step 106 then branches to step 108.

Assuming that there are n association rules R having sufficient supportin the current data set P_(C), step 108 then identifies the firstassociation rule R₁ (R_(n) where n=1) in the current data set P_(C) anddetermines if the association rule R₁ was identified in the priorpartition P_(C-i). If not, step 108 branches to step 110 whichincrements the value of i and then branches back to step 108.Consequently, the loop represented by step 108 and step 110 iterativelysearches the confidences of association rule R₁ for all of the previouspartitions P_(C-i).

Conversely, assuming that the association rule R₁, i.e. the firstassociation rule in the current data set P_(C) is also present in thedata partition P_(C-i), step 108 instead branches to step 112 in whichthe change of the confidence factor for the association rule R₁ in thecurrent data set P_(C), Conf(R₁, P_(C)) with the correspondingconfidence factor in the earlier data set P Conf(R₁, P_(C-i)) is greaterthan a predetermined threshold T_(H) as determined by a chi-square testof two proportions. If not, step 112 then branches to step 110 whichincrements the counter i for the data set P_(C-i) and then returns tostep 108.

Conversely, if the confidence factor for the association rule R₁ haschanged more than the threshold amount T_(H) as determined by achi-square test of two proportions, step 112 instead branches to step114 which both generates and stores an alert. Step 114 then branches tostep 116 which determines if all of the rules R_(i) in the current dataset P_(C) have been analyzed in the previously described fashion. Ifnot, step 116 branches to step 118 which increments the counter n to thenext association rule R_(n). Step 118 then branches to step 106 wherethe above-identified process is repeated.

As can thus be seen, with the iterative algorithm depicted in FIG. 1,the confidence factor Conf(R_(n), P_(C)) for each association rule R_(n)in the current set P_(C) is compared with the confidence factorConf(R_(n), P_(C-i)) for the corresponding association rule in theprevious data sets P_(C-i). If the confidence factor changes by morethan a preset threshold T_(H) as determined by a chi-square test of twoproportions, an appropriate alarm is generated.

Following complete analysis of all of the data in the current data setP_(C), step 116 branches to step 120 which presents the results to theoperator.

With reference now to FIG. 2, in order to provide more efficient dataanalysis of the alerts generated at step 114 (FIG. 1) the alerts areclustered. As best shown in step 122, all of the alerts generated atstep 114 (FIG. 1) are first marked as not in an event. Step 122 thenbranches to step 124. At step 124 an alert X is selected such that thealert X is not in an event and that every descendent of X is in an eventor that X has no descendants. Step 124 then branches to step 126. Step126 then marks X as an event X′ and then branches to step 128. Step 128then identifies each alert Y that is a descendent of X and then branchesto step 130.

Step 130 then determines if Y is in the same event as X. If so, step 130branches to step 132 which marks alert Y as in event X′. Otherwise, step130 increments the alert Y and then branches back to step 128 where theabove process is repeated.

After each alert Y has been examined for the first selected alert X,step 132 branches to step 134 which tests that there is an alert that isnot in an event. If so, step 134 branches to step 124 where theabove-identified process is repeated. Otherwise, step 134 exists to step120 (FIG. 1).

From the foregoing, it can be seen that the present invention provides anovel method of analyzing temporal data such as the type of data used inhealth surveillance. No unnecessary limitation, however, should be drawntherefrom since the data analysis method of the present invention may beused for any type of temporal data.

A primary advantage of the present invention is that, unlike previousmethods which primary analyze data associations having high supportlevels and thus oftentimes uninteresting, the present invention insteadidentifies changes in confidence factors of the various identifiedassociation rules. Since it is the confidence factor, rather than thegross number of data records, that is monitored, changes in both commonassociation rules as well as less common association rules, are equallyidentified and presented to the operator as an alert.

Having described my invention, however, many modifications thereto willbecome apparent to those skilled in the art to which it pertains withoutdeviation from the spirit of the invention as defined by the scope ofthe appended claims.

We claim:
 1. A method for analyzing sets of temporal data, each data setof temporal data comprising a plurality of records collected at a timeperiod unique to each said set, each record having a plurality of dataitems, said method comprising the steps of: creating data associationrules for at least a plurality of sequential sets, each association rulerepresenting data records having at least some common data items,determining the confidence factor for each such association, storingsaid association rules and said confidence factors for said at least aplurality of sequential data sets in data partitions, comparing saidconfidence factors for each association rule a selected data partitionwith the corresponding confidence factors for at least one other datapartition, generating an output signal whenever the change of confidencefactor for said selected data set varies from the correspondingconfidence factor for said at least one other data set exceeds aselected threshold value.
 2. The invention as defined in claim 1 whereinsaid creating data association rules step further comprises the step ofcreating said data association rule only when the number of data recordshaving common data elements exceeds a preset number.
 3. The invention asdefined in claim 1 wherein said data set comprises health care data. 4.The invention as defined in claim 1 and comprising the further step ofclustering output signals corresponding to association rules havingcommon data items.